PCA Reduction and K-Means Clustering
1 Spotify Clustering
In this report, I will attempt to do PCA reduction on this Spotify dataset and cluster each song afterwards based on their characteristics.
2 Structure of this report
- Read data and basic pre-processing
- PCA reduction:
- Insights on summary and plot of PCA
- Uses of PCA
- Return as dataframe
- Clustering using K-Means:
- Finding optimum K using elbow method
- Clustering and evaluation of cluster
- Tuning cluster
- Purpose of clustering:
- Cluster Profiling
- Song Recommendation
3 Read data and basic pre-processing
I will take only the first 10000 rows because the huge amount of data requires too much computation and my laptop cannot handle it very well.
library(tidyverse)
spotify <- read.csv("SpotifyFeatures.csv", row.names=NULL)
spotify10000 <- head(spotify, 10000)spotify_clean <- spotify10000 %>%
mutate_if(is.character, as.factor) %>%
mutate(track_name = as.character(track_name)) %>%
select(-track_id)spotify_number <- spotify_clean %>%
select_if(is.numeric)4 PCA reduction
library(FactoMineR)
spotify_pca <- PCA(spotify_number, scale.unit = TRUE, graph = FALSE)
spotify_pca2 <- prcomp(spotify_number, scale. = T)4.1 Insights of PCA and plot of PCA
summary(spotify_pca)##
## Call:
## PCA(X = spotify_number, scale.unit = TRUE, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 2.711 1.526 1.191 1.120 0.985 0.930 0.850
## % of var. 24.650 13.871 10.825 10.177 8.956 8.453 7.727
## Cumulative % of var. 24.650 38.521 49.347 59.524 68.480 76.934 84.660
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.686 0.471 0.389 0.141
## % of var. 6.237 4.282 3.540 1.281
## Cumulative % of var. 90.897 95.180 98.719 100.000
##
## Individuals (the 10 first)
## Dist Dim.1 ctr cos2 Dim.2 ctr cos2
## 1 | 4.929 | 0.594 0.001 0.014 | 0.167 0.000 0.001 |
## 2 | 4.028 | 0.160 0.000 0.002 | 1.078 0.008 0.072 |
## 3 | 5.160 | -4.688 0.081 0.825 | 0.890 0.005 0.030 |
## 4 | 5.242 | -3.177 0.037 0.367 | -2.013 0.027 0.148 |
## 5 | 6.406 | -5.242 0.101 0.670 | -0.783 0.004 0.015 |
## 6 | 5.267 | -4.714 0.082 0.801 | 0.742 0.004 0.020 |
## 7 | 10.408 | -3.134 0.036 0.091 | 2.932 0.056 0.079 |
## 8 | 4.124 | -3.388 0.042 0.675 | -0.861 0.005 0.044 |
## 9 | 3.905 | -0.803 0.002 0.042 | 1.654 0.018 0.179 |
## 10 | 3.095 | -0.407 0.001 0.017 | 0.923 0.006 0.089 |
## Dim.3 ctr cos2
## 1 1.208 0.012 0.060 |
## 2 0.751 0.005 0.035 |
## 3 -0.162 0.000 0.001 |
## 4 0.151 0.000 0.001 |
## 5 0.195 0.000 0.001 |
## 6 0.530 0.002 0.010 |
## 7 6.161 0.319 0.350 |
## 8 -0.135 0.000 0.001 |
## 9 0.211 0.000 0.003 |
## 10 0.837 0.006 0.073 |
##
## Variables (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## popularity | 0.464 7.956 0.216 | -0.051 0.172 0.003 | -0.284 6.756
## acousticness | -0.839 25.990 0.705 | 0.061 0.244 0.004 | 0.054 0.243
## danceability | -0.027 0.026 0.001 | 0.829 45.092 0.688 | -0.038 0.121
## duration_ms | -0.042 0.064 0.002 | -0.382 9.572 0.146 | 0.412 14.256
## energy | 0.923 31.432 0.852 | -0.013 0.011 0.000 | 0.090 0.674
## instrumentalness | -0.103 0.391 0.011 | -0.247 3.987 0.061 | -0.129 1.397
## liveness | 0.132 0.638 0.017 | -0.066 0.288 0.004 | 0.644 34.806
## loudness | 0.865 27.578 0.748 | -0.050 0.163 0.002 | -0.056 0.265
## speechiness | -0.010 0.004 0.000 | 0.196 2.512 0.038 | 0.682 39.055
## tempo | 0.289 3.084 0.084 | -0.264 4.575 0.070 | 0.127 1.359
## cos2
## popularity 0.080 |
## acousticness 0.003 |
## danceability 0.001 |
## duration_ms 0.170 |
## energy 0.008 |
## instrumentalness 0.017 |
## liveness 0.414 |
## loudness 0.003 |
## speechiness 0.465 |
## tempo 0.016 |
spotify_pca2$rotation## PC1 PC2 PC3 PC4 PC5
## popularity 0.282070151 -0.04142711 0.25991994 0.480946305 -0.168524430
## acousticness -0.509804020 0.04939635 -0.04926554 -0.118826841 -0.052295865
## danceability -0.016198945 0.67150434 0.03478232 0.255147015 0.045055855
## duration_ms -0.025270302 -0.30938238 -0.37757534 0.418864601 0.013814015
## energy 0.560640684 -0.01035973 -0.08207478 -0.038132956 0.097616839
## instrumentalness -0.062556989 -0.19966894 0.11817840 0.241592532 0.890209988
## liveness 0.079904988 -0.05368054 -0.58996954 -0.153789724 0.188196360
## loudness 0.525150228 -0.04034814 0.05151931 0.004540255 -0.088889536
## speechiness -0.006331369 0.15849523 -0.62494273 0.305035466 -0.153970017
## tempo 0.175624678 -0.21389069 -0.11658289 -0.529704217 0.005597191
## valence 0.168381294 0.57780076 -0.10332422 -0.238531173 0.312383767
## PC6 PC7 PC8 PC9 PC10
## popularity -0.02950508 0.45021448 0.5226917 0.33622394 0.032765622
## acousticness -0.00396675 0.07107606 0.1129885 0.25154508 0.743901950
## danceability 0.08446596 -0.03578955 0.2431472 -0.61647953 0.109700117
## duration_ms 0.29685305 -0.62279352 0.3261231 0.03104665 0.046051677
## energy -0.02343963 -0.09631607 -0.2064706 0.07960324 0.092039521
## instrumentalness 0.10781974 0.21153464 -0.1092217 -0.07509894 0.121758440
## liveness -0.64727160 0.15294677 0.3553483 -0.11327919 -0.008070191
## loudness -0.08073225 -0.12337475 -0.1746696 -0.13406458 0.634227590
## speechiness 0.30422239 0.43483367 -0.4140578 0.08768901 0.006880046
## tempo 0.58870694 0.28298568 0.3774842 -0.24149273 0.041327337
## valence 0.16455283 -0.20323262 0.1525471 0.57782134 -0.063403402
## PC11
## popularity -0.023698688
## acousticness -0.289618907
## danceability -0.144438889
## duration_ms -0.002832397
## energy -0.774978055
## instrumentalness 0.051436298
## liveness 0.045579018
## loudness 0.489434990
## speechiness 0.081977133
## tempo -0.004415781
## valence 0.207576879
spotify_pca2$sdev## [1] 1.6466586 1.2352536 1.0912351 1.0580680 0.9925793 0.9642913 0.9219191
## [8] 0.8282730 0.6863430 0.6240054 0.3753269
Based on the summary above, we can tell that for us to have at least 80% of data (which means 20% loss of data), we need at least 7 PCs (PC1 + PC2 +... PC7).
We can also tell that the columns has different weightage in terms of affecting each PC. Energy affects PC1 the most (0.560640684), whereas danceability affects PC2 the most (0.67150434) and so on.
We are also able to tell the eigen value of each PC. PC1 has the highest eigen value of 2.711, PC2 has the second highest eigen value of 1.526, followed by PC3 with 1.191, and so on. This directly corresponds to the amount of data it carries. Since PC1 has the highest eigen value, it also contribute the most data in terms of percentage (24.650%), followed by PC2 (13.871%) and PC3 (10.825%).
plot.PCA(x = spotify_pca, choix = c("ind"), select = "contrib7", habillage = "ind")Based on the plot above, we can tell that there are several outliers such as data with the index of 343, 97, 451, 471, 284, 133 and 6305. We can further analyse the outliers and see whether the outlier affects PC1 more or PC2 more. For example, data with the index of 6305 and 343 affect PC1 more than PC2 (as seen in its position and the scale of both axes), whereas data with the index of 133 affects PC2 more than PC1.
Next we can analyse the effects of each columns on the PCs.
plot.PCA(spotify_pca, cex=0.6, choix = c("var"))a <- dimdesc(spotify_pca)
as.data.frame(a[[1]]$quanti) #correlation to PC1as.data.frame(a[[2]]$quanti) #correlation to PC2Based on the plot and dataframe above, acousticness, danceability and valence affect PC2 more than PC1, whereas energy, loudness, popularity and tempo affect PC1 more than PC2. From here, we can also tell the collinearity between columns. Energy and loudness have very high positive collinearity whereas popularity and acousticness has very high negative collinearity. This can be seen through the relative position and direction to each column name on the plot.
4.2 Uses of PCA
Besides just to reduce the dimension of data without much loss of the data itself, PCA can be used to tackle the no-multicollinearity assumption needed for predictors when making a linear regression model. This is because by doing PCA, columns (that contain the PCA value) would no longer be collinear to each other.
Below is an example:
library(GGally)## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggcorr(spotify_number, label = T)We can see from the plot above that several columns have very high collinearity to each other. For example, loudness and energy has very high positive collinearity (0.8) and energy and acousticness has very high negative collinearity (-0.7). If we were to make a linear regression model out of this dataset, we might not be able to fulfill the assumption of no-multicollinearity between predictors.
Hence, one way to solve this issue would be to do PCA on the data and use that result instead of the original numerical values.
ggcorr(data.frame(spotify_pca2$x), label = T)As we can see, there is 0 correlation between the PCs, allowing it to fulfill no multicollinearity if it were to be made into a linear regression model. However, there is a caveat: once it is made into a bunch of PCs, we would not be able to interpret the numbers anymore, so use this method sparingly and accordingly.
4.3 Return as dataframe
Since we have decided to accomodate for 20% loss of data, I will subset only the first 7 PCs and return it into a dataframe with the non-numerical data.
pca_keep <- spotify_pca2$x[,c(1:7)] %>%
as.data.frame()
spotify_final <- spotify_clean %>%
select_if(negate(is.numeric)) %>%
bind_cols(pca_keep)
head(spotify_final)5 Clustering using K-Means
We will now cluster the songs based on their characteristics into several clusters.
summary(spotify_clean)## genre artist_name track_name
## A Capella : 119 Chorus : 102 Length:10000
## Alternative:5054 Henri Salvador : 88 Class :character
## Country :4162 George Strait : 68 Mode :character
## Dance : 113 Five Finger Death Punch: 63
## Movie : 408 Linkin Park : 61
## R&B : 144 Kenny Chesney : 60
## (Other) :9558
## popularity acousticness danceability duration_ms
## Min. : 0.00 Min. :0.0000014 Min. :0.0617 Min. : 18800
## 1st Qu.: 39.00 1st Qu.:0.0126000 1st Qu.:0.4710 1st Qu.: 189877
## Median : 47.00 Median :0.1120000 Median :0.5620 Median : 216760
## Mean : 45.97 Mean :0.2453824 Mean :0.5598 Mean : 226406
## 3rd Qu.: 54.00 3rd Qu.:0.4222500 3rd Qu.:0.6540 3rd Qu.: 249410
## Max. :100.00 Max. :0.9950000 Max. :0.9710 Max. :3631469
##
## energy instrumentalness key liveness
## Min. :0.00154 Min. :0.0000000 G :1230 Min. :0.0214
## 1st Qu.:0.49075 1st Qu.:0.0000000 D :1179 1st Qu.:0.0985
## Median :0.68300 Median :0.0000073 C :1122 Median :0.1310
## Mean :0.65174 Mean :0.0335583 A :1009 Mean :0.1945
## 3rd Qu.:0.83600 3rd Qu.:0.0007252 C# : 891 3rd Qu.:0.2460
## Max. :0.99800 Max. :0.9840000 E : 874 Max. :0.9960
## (Other):3695
## loudness mode speechiness tempo
## Min. :-29.368 Major:7381 Min. :0.02230 Min. : 32.24
## 1st Qu.: -8.896 Minor:2619 1st Qu.:0.03200 1st Qu.: 96.84
## Median : -6.496 Median :0.04190 Median :120.00
## Mean : -7.277 Mean :0.07557 Mean :121.72
## 3rd Qu.: -4.873 3rd Qu.:0.07220 3rd Qu.:142.73
## Max. : -0.259 Max. :0.96100 Max. :216.03
##
## time_signature valence
## 1/4: 47 Min. :0.0000
## 3/4: 700 1st Qu.:0.3130
## 4/4:9148 Median :0.4770
## 5/4: 105 Mean :0.4893
## 3rd Qu.:0.6650
## Max. :0.9830
##
Since the ranges between each column to another varies a lot (max of duration_ms is 3631469 while max of valence is 0.9830), we need to scale the data.
spotify_clusternew <- spotify_clean %>%
select_if(is.numeric) %>%
scale() %>%
as.data.frame()5.1 Finding Optimum K using elbow method
library(factoextra)
fviz_nbclust(spotify_clusternew, kmeans, method = "wss")Based on the above graph, we can see that the most suitable k is 5. This is because the drop in the total within sum of square value from 5 to 6 is very low. The lower the within sum of square, the more tight each cluster it to its center.
5.2 Clustering and evaluation of cluster
set.seed(100)
spotify_kmeans <- kmeans(spotify_clusternew, centers = 5)fviz_cluster(spotify_kmeans, spotify_clusternew, ggtheme = theme_minimal())spotify_kmeans$betweenss/spotify_kmeans$totss## [1] 0.3207211
According to the computation above, this cluster is still not very good because the ratio of its between sum of square value (the total distance between each centroid to the center of the whole data) to its total sum of square (the total distance of each data to the center of the whole data) is very low (the closer to 1, the better).
5.3 Tuning cluster
In order to get a more favourable outcome, we will try to change the number of clusters.
We will try with 8 since in the plot of the elbow method, it also has the least drop (from 8 clusters to 9) in total within sum of squares
set.seed(100)
spotify_kmeans2 <- kmeans(spotify_clusternew, centers = 8)fviz_cluster(spotify_kmeans2, spotify_clusternew, ggtheme = theme_minimal()) As we can see from the above diagram, the clusters have a lot of overlap. This may not be so good, so we can try a smaller number of clusters.
set.seed(100)
spotify_kmeans3 <- kmeans(spotify_clusternew, centers = 3)
fviz_cluster(spotify_kmeans3, spotify_clusternew, ggtheme = theme_minimal())Evaluation of the two final clustering models in comparison to the original cluster model.
spotify_kmeans$tot.withinss## [1] 74713.21
spotify_kmeans2$tot.withinss## [1] 57609.36
spotify_kmeans3$tot.withinss## [1] 85106.21
A good model has a small total within sum of square. Since total within sum of square measures the total distance between each data in a cluster to its centroid, the smaller the value, the tighter the cluster is, making it more accurate in separating different songs.
From this, we can see that the cluster model that has the least total within sum of square is the model that has 8 clusters.
spotify_kmeans$betweenss## [1] 35275.79
spotify_kmeans2$betweenss## [1] 52379.64
spotify_kmeans3$betweenss## [1] 24882.79
A good model has a large between sum of square. Since between sum of square measures the total distance between each centroid of the clusters to the center of the data, the larger the value, the more distinct each cluster is to one another.
From this, we can see that the cluster model that has the least between sum of square is the model that has 8 clusters.
spotify_kmeans$betweenss/spotify_kmeans$totss## [1] 0.3207211
spotify_kmeans2$betweenss/spotify_kmeans2$totss## [1] 0.4762262
spotify_kmeans3$betweenss/spotify_kmeans3$totss## [1] 0.2262298
From this, we can see that the cluster model that has 8 clusters has a ratio of between sum of square to total sum of square that is closest to 1.
According to the 3 models, the best cluster model goes to the model that consists of 8 clusters, hence we will move forward with that model.
6 Purpose of clustering
6.1 Clustering profiling
spotify_clusternew %>%
mutate(cluster = as.factor(spotify_kmeans2$cluster)) %>%
group_by(cluster) %>%
summarise_all(mean) %>%
pivot_longer(cols = -c(1), names_to = "type", values_to = "value") %>% #column besides cluster is transformed
ggplot(aes(x = cluster, y = value, fill = cluster)) +
geom_col() +
facet_wrap(~type) +
theme_minimal()I have break down the clusters in terms of its characteristics so that we can better visualise the different characteristics of each cluster.
From this, we can see that cluster 6 songs have very long average duration as compared to songs in other clusters. Cluster 6 also has the highest average speechiness score and lowest average popularity score as compared to songs in other clusters. On the other hand, cluster 1 songs have the highest average liveliness score as compared to songs in other clusters. Another interesting observation is the cluster 4 and 6 songs have very similar acousticness, energy, loudness and popularity scores, though they differ greatly in terms of duration and danceability score.
6.2 Song recommendation
spotify_clusternew %>%
mutate(cluster = as.factor(spotify_kmeans2$cluster)) %>%
mutate(track = as.factor(spotify_clean$track_name)) %>%
group_by(cluster) %>%
arrange(cluster) %>%
filter(cluster == "1") %>%
select_if(negate(is.numeric))For example, if a user of Spotify were to like the song "But pour Rudy" a lot, Spotify would be able to recommend him/her songs that are in the same cluster as "But pour Rudy", such as "Flawless Remix" or "Remember Me (Dúo)".
Similarly, if a user of Spotify were to like a song in another cluster, Spotify would be able to use this algorithm to predict songs that suit the user's taste (songs that are in the same cluster as the user's favourite song).